Skip to main content

Match Identity

Match Studio

Dashboard

dash.png

The dashboard is the first thing you will see when you start Match Studio. You can return here at any time by selecting Match Studio or the Babel Street logo on the left side of the navigation bar. The dashboard contains the following options:

  • Explore Search: Search an existing index, or upload a new one.

    • Use our Data: Brings you to the Search page, where you can select an index and perform a single or batch search on it.

      Note

      When using Match Studio in locked mode, this option is disabled until at least one index has been added.

    • Use your Data: Launches the Import Index wizard, where you can upload your own structured data and turn it into a searchable index.

  • Test and Compare: Brings you to the Compare page, where you can compare names, dates, and addresses, view their match scores, learn about how the match was determined, and test various parameters.

  • Customize: Control how Match Studio handles your searches by adding overrides and stop words, adjusting match parameters, and performing evaluations to find out which parameter values you should use.

    Note

    If you are using Match Studio in locked mode, only parameters will be available.

  • Documentation: Click on the question mark in the navigation bar to access our online documentation.

  • Guided Tours: Click on the lightbulb in the lower left-hand corner of the product to start a step-by-step tour of the features of Match Studio.

OFAC List

Match Studio comes preloaded with an index in the form of the OFAC list, a data file obtained from the Office of Foreign Assets Control. The list contains Latin-script names from many countries. This index is used for our Guided Tours. It's also available for you to experiment with and learn more about the features of Match Studio.

To perform a search on the OFAC list, navigate to the Search page, either by selecting Search from the navigation bar at the top of the page, or by selecting Search an Index from the dashboard.  Then, select the OFAC list from the index dropdown menu. Since Match Studio is multilingual, you can enter a name to search in any script; the matches will all be Latin-script.

The index is only created on the bundled instance of Elasticsearch. It isn't uploaded if you are using your own Elasticsearch instance, for instance in locked mode.

Compare compare.png

Compare two values of the same field type and view the match score, along with details about how the match score was determined.

The comparison can be between a search value and a matched value or two new terms.

  • View match scores and understand how the score was determined.

  • Modify the settings and see the resulting change in match scores.

There are two ways to access the Compare section:

  • Click on Compare from the navigation bar.

  • Hover over the name and select the compare icon compareicon.png in a search result. This method allows you to automatically compare a searched name with a retrieved name.

To compare two values:

  1. Enter two values to compare.

    Note

    The values might be the search value with a matched value, or two new values. These will be pre-filled for you when navigating from the search results list.

  2. (Optional) If comparing names (person or organization), select the language of each value from the drop-down list. Adding the source languages may improve the match scores.

  3. (Optional) Expand the Show Configurations panel to see or modify parameter configurations.

  4. Compare.

Compare names

When comparing two names, a match score will be displayed below the names being compared, along with a table including additional data about the names such as language, script, entity type, and what the names look like after normalization.

Below the match score, you will find the score matrix, a tabular representation of component scores and match types. For more details on how name matching works, see Understanding name match scores.

For Name fields, the score matrix includes:

  • Tokens identified

  • Match phenomenon for the token pair

  • Raw score (score for the two tokens in a vacuum).

  • Context score (score with weight, token position, and token type considered).

matchscoredisplay.png
 

The tokens from Name 1 are listed down the first column, while the tokens from Name 2 are along the top row. The shaded boxes highlight the token pairs selected during matching that produce the best score. A token pair is a token from Name 1 and its matching token from Name 2. The matchtype column lists the match phenomenon for each token pair. The context score takes into consideration the placement of the token in the score calculation. A penalty is applied if the tokens are out of order. When the tokens line up on the diagonal, they are all in order.

Adjusting parameter values

Beneath the computation score matrix, the Parameter Test tab allows you to test different values for the most common parameters and visualize the results in graph or table form. Clicking the vertical ellipsis icon at the top-right of the Parameter Test panel opens a menu that allows you to reset to default values, apply the current Parameter Test values to the Advanced Configuration section above, or change the Parameter Test values to match the Advanced Configuration section.

Select a parameter, adjust it as desired, and click Run Test to see how changing that parameter will affect the match score. In the Test Results graph to the right, the orange dot represents the score at the current value, while the blue dot represents the score at the default value. The green shaded area represents scores above the match threshold. Hover over any part of the graph to show the match score at that parameter value, and click to apply that value to the clipboard.

Detailed match score information

Tokens tab displays additional information about the weight of each token. Weight determines how important the token pair match is in calculating the final score. For example, unusual tokens get a higher weighting than common names because it is more significant when they match and initials are weighted less than full names. The bin value reflects how "unusual" a token is; the lower the bin, the higher the weight.

The Match Pairs tab displays a simplified list of token pairs along with their final score and match phenomena. This tab also displays whether a pair was used in the match (pairs are not used in a match if their score is too low).

The Final Calculation tab provides detailed information on the process of determining the match scores. This includes a breakdown of the parameters which played a part in determining the score, and what their exact impact on the score was.

Compare names configuration

Match Studio and Babel Street Match are turned to perform well in a variety of name matching scenarios. However, every use case uses different data with distinct match requirements. In Match Studio, you can easily modify and tune match parameters, seeing how each value impacts the final match score in real-time, to improve the accuracy of matches.

To view and modify these settings, click Show Configurations above the match score.

  • In the Parameter Configuration tab, you can select a defined parameter configuration from the drop-down menu. With this method, you will not be able to manually change the parameter values.

  • In the Advanced Configuration tab, you can manually change the value of the most common parameters. The default parameter values are based on the entity type (person, organization, or location) and the language. If you want to undo a change you made to a parameter, click Reset in the Modified column.

    Note

    To save a set of parameter values, click on the copy url icon copy_url.png in the upper right corner. The url contains the parameter values. You must run the Compare before copying the parameter values.

When you are finished editing the configuration, click Apply and Compare to see how your changes impact the match score of the names being compared.

Match phenomena

Match phenomena describe why token spans did or did not match. For example, some match phenomena, such as HMM_MATCH, occur when tokens are matched by a particular scorer. Others, such as DELETION, occur when tokens cannot be matched at all.

Name

Description

Example

CONFLICT

The tokens do not match.

When comparing "William Omega Stephens" and "William Kappa Stephens", "Omega" and "Kappa" are a CONFLICT.

DELETION

The token is unmatched.

When comparing "Richard William Smith" with "Richard Smith", "william" would be considered a DELETION.

EMBEDDING_MATCH

The tokens are semantically similar as determined by word-embedding vectors.

When comparing "boston building company" and "boston construction company", "building" and "construction" are an EMBEDDING_MATCH.

FIELD_BLOCKED

This field cannot be matched because of a cross-field match involving the same field in the other name.

When comparing "Bob|William|Smith" with "William||Smith", "bob" is a FIELD_BLOCKED since the cross-field william match prevents it from matching with its corresponding field.

FIELD_CONFLICT

When comparing two names that are divided into fields, these fields do not match.

When comparing "Richard|William|Smith" with "Richard|Johnson|Smith", "william" and "johnson" would be considered a FIELD_CONFLICT.

FIELD_DELETION

When comparing two names that are divided into fields, this field is unmatched.

When comparing "Richard|Xi|Smith" with "Richard||Smith", "xi" would be considered a FIELD_DELETION.

GIVEN_NAME_DELETION

When comparing two names that are divided into fields, the GIVEN_NAME field is unmatched.

When comparing "Richard|William|Smith" and "||William|Scott", "Richard" will be a GIVEN_NAME_DELETION if that field in both names is marked as a Given_name field.

HANI_ABBREVIATION

One Hani token appears to be an abbreviation of another Hani token.

"北京大学" and "北大" are a HANI_ABBREVIATION match.

HMM_MATCH

The tokens are similar but not identical, and the match was determined by a particular model (hidden Markov model). This is a type of fuzzy match.

"richard" and "richerd" are an HMM_MATCH.

INITIALISM

One token is a name and the other token is the initials of the words which make up the name.

"john fitzgerald kennedy" and "JFK" are an INITIALISM.

"consumer value stores" and "CVS" are an INITIALISM.

INITIAL_MATCH

One token is the first initial of the other.

"w" and "william" are an INITIAL_MATCH.

LANGUAGE_SPECIFIC_MATCH

The match was determined by a language-specific matcher.

"laden" and "لادن" are a LANGUAGE_SPECIFIC_MATCH.

MATCH

The tokens are identical (after stop word elimination and normalization).

"john" and "john" are a MATCH.

NULL

The NULL phenomenon is only listed in this table for completeness. It is only used internally and will never be returned in the SpanMatch object.

N/A

OUT_OF_ORDER_DELETION

This unmatched token still leaves the remaining tokens out of order when it is removed.

When comparing "George Herbert Walker Bush" with "George Bush Walker", "herbert" would be considered an OUT_OF_ORDER_DELETION.

OVERRIDE

The tokens appear as a pair on the override list. This is often used for nicknames.

"john" and "jack" will be an OVERRIDE match if they appear as a pair on the override list.

PREFIX_INITIAL

One token is an initial that matches a prefix in the other token.

In practice, the PREFIX_INITIAL phenomenon is rare.

If the initialsScore parameter is set to 0.1, "E Silva" and "EduardoSil" will be a PREFIX_INITIAL match.

STRING_SIMILARITY

The tokens are similar in string edit distance (number of insertions, deletions, and substitutions) but not similar enough to be a fuzzy match.

"akcd" and "xkcd" are a STRING_SIMILARITY match.

STUCK_INITIAL

One name appears to have an initial mistakenly attached to a preceding token.

"DavidK" and "David Keith" are a STUCK_INITIAL match.

SURNAME_DELETION

When comparing two names that are divided into fields, the SURNAME field is unmatched.

When comparing "Richard|William|Smith" and "Richard|William||", "Smith" will be a SURNAME_DELETION if that field in both names is marked as a Surname field.

TRAILING_PATRONYMIC_DELETION[a]

The unmatched token is a patronymic which has been truncated in the other name.

When comparing "Faisal bin Fahd bin Abdullah" and "Faisal bin Fahd", "bin Abdullah" is considered a TRAILING_PATRONYMIC_DELETION.

TRUNCATED_EXACT_MATCH

The tokens are identical except that one has been slightly truncated.

"murgatroyd" and "murgatroy" are a TRUNCATED_EXACT_MATCH.

TRUNCATED_HMM_MATCH

The tokens are similar, but not identical, and one has been slightly truncated.

"gilpatrickz" and "gillpatrick" are a TRUNCATED_HMM_MATCH.

UNKNOWN_FIELD_MATCH

One of the tokens is part of an "unknown" field in a fielded name.

The UNKNOWN_FIELD_MATCH phenomenon is rare and usually requires use of the Java API.

When comparing "Richard|William|Smith" with "Richard|William|Scott", if the first field is an "unknown" field, "richard" and "richard" would be considered an UNKNOWN_FIELD_MATCH.

[a] Only applies to Latin script names of Arabic origin.

Compare dates

Match calculates the match score based on five different components:

  • Time Proximity: The number of days between Date 1 and Date 2.

  • Year: The difference of the year fields of Date 1 and Date 2.

  • Month: The difference of the month fields of Date 1 and Date 2.

  • Day: The difference of the day fields of Date 1 and Date 2. 1 and 30 are far apart in value, even if they are close in time.

  • String: The string distance is calculated by converting Date 1 and Date 2 to a standard format. The score is calculated on the edit distance between the two strings.

date_match.png
Supported date formats

Match supports a wide variety of date formats. 

  • Days can be represented by 1 or 2 digits. Alphanumerics, such as 2nd or 1st, are not supported.

  • Months can be numerics (1 or 2 digits) or English characters (full name or 3 character abbreviation).

  • Years can be represented by 1, 2, 3 or 4 digits.

  • Supported delimiters include , . - /, as well as a space.

  • Partial fields can be entered.

  • At this time, only English month names and abbreviations are recognized.

  • All words are case-insensitive; upper and lower case are interpreted the same.

The following table shows different acceptable formats for the date March 7, 1984.

Format

Valid examples

Notes

Y-M-D

1984-03-07; 1984/3/7; 1984.3.07; 1984 Mar 07; 1984-March-7

M-D

03-07; 3/7; Mar-07; March 7

Y-M

1984-03; 1984 March; 1984-Mar

YYYYMMDD

19840307

All 8 digits must be included

M-D-Y

03-07-1984; 3/7/84; March 7 84; Mar. 7, 1984

M-YYYY

03-1984; March 1984; Mar-1984

The year must include 4 digits. March-84 will not be recognized.

D-M-Y

07 03 1984; 7/3/84; 07 March 84; 7/Mar/1984

D-M

07-03; 7/3; 07-Mar; 7 March

D(MONTH)Y

7MAR84; 07March1984

The month is a word or abbreviation

YYYY

1984

Month

March

Compare dates configuration

As with names, you can apply different parameter configurations to a pairwise date match from the Compare page. You can also also modify the weights given to each type of date match to see how each value impacts the final match score.

To view and modify the weightings of date types:

  1. Click Show Configurations.

  2. Click Advanced Configuration.

  3. Change the weightings of the date component matches. You can also choose to disable swap by turning off tryDayMonthSwap.

  4. Apply and Compare.

Changes are applied to the date pair currently displayed; overall date search settings are not modified. Click Reset in the Modified column to set the values back to their default settings.

Note

To save a set of parameter values, click on the copy url icon copy_url.png in the upper right corner. The url contains the parameter values. You must run the Compare before copying the parameter values.

Enable swap

Because dates are sometimes written month day and other times written day month, swap tries matching the date fields as written as well as with the month and date fields switched. The best score is returned as the match score. For example, if the dates in question are 1970-3-5 and 1970-6-4, this feature will match the following four pairs:

1970-3-5

1970-6-4

1970-3-5

1970-4-6

1970-5-3

1970-6-4

1970-5-3

1970-4-6

The maximum score of the four pairs is then returned as the match score.

Turn on tryDayMonthSwap in the Advanced Configuration tab if you think there may be formatting inconsistencies in your dates and the dates you are matching may not always have days and months in the same positions.

If the selected match score is from a swapped pair, a penalty score is applied, indicating less certainty in the match. The displayed Pre Swap Penalty Score is the score returned for the selected swapped pair, before the penalty is applied.

Weighting date matches

The date weighting fields control the relative strength of each aspect of the date-matching algorithm.  A separate score is calculated for each match type. The final match score is calculated by performing a weighted arithmetic mean over each of the similarity scores. If a field is missing from a record, that field is ignored and its weight evenly distributed across other fields.

Table 33. Date weighting parameters

Parameter name

Score based on

Example

timeDistanceWeight 

The number of days in between the two input dates

1979-12-31 and 1980-1-1 look different, but their time difference is very close. They will have a high match score.

yearDistanceWeight

The difference of the year fields

Close years will have a high match score.

monthDistanceWeight

The difference of the month fields 

1 and 12 are far, even if they are close in time. They will have a low match score.

dayDistanceWeight

The difference of the day fields

1 and 30 are far, even if they are close in time. They will have a low match score.

stringDistanceWeight

The edit difference between the two dates, when converted to a standard string (05021974 for 5/2/1974)

1979-12-31 and 1980-1-1 will be 19791231 and 19800101.  They will have a low match score.



Dates with a high time match score may have a very low string match score. Time finds dates that are close together; string gives high scores to similarly formatted dates.

Compare addresses

Match Studio compares addresses by comparing the fields within each address. You can enter the address as a single field or as separate fields.

  1. Select the format of the address: Single or Multi Field. There is also the Extended option for addresses with even more fields. Both address 1 and address 2 must be in the same format.

  2. Enter the addresses.

  3. Compare.

Match Studio optimizes the matching algorithm to the field type. Named entity fields, such as street name, city, and state are matched using an algorithm similar to name matching. Numeric and alphanumeric fields such as house number, postal code, and unit, are matched using numerically-based methods.

The Match Score Computation displays the match matrix for the address fields. Each field in address 1 is compared with each field in address 2, similar to how name tokens are scored.

address-match.png

Use Show Configurations to apply different parameter configurations or edit individual address matching parameters, respectively, similar to name and date matching.

Supported address fields

Addresses can be defined either as a single field or as a set of address fields. When defined as a single field, the jpostal library is used to parse the address string into address fields.

When entered as a set of fields, the address may include any of the fields below. At least one field must be specified, but no specific fields are required.

Table 34. Supported address fields

Field name

Description

Example(s)

house

venue and building names

"Brooklyn Academy of Music", "Empire State Building"

houseNumber

usually refers to the external (street-facing) building number

"123"

road

street name(s)

"Harrison Avenue"

unit

an apartment, unit, office, lot, or other secondary unit designator

"Apt. 123"

level

expressions indicating a floor number

"3rd Floor", "Ground Floor"

staircase

numbered/lettered staircase

"2"

entrance

numbered/lettered entrance

"front gate"

suburb

usually an unofficial neighborhood name

"Harlem", "South Bronx", "Crown Heights"

cityDistrict

these are usually boroughs or districts within a city that serve some official purpose

"Brooklyn", "Hackney", "Bratislava IV"

city

any human settlement including cities, towns, villages, hamlets, localities, etc.

"Boston"

island

named islands

"Maui"

stateDistrict

usually a second-level administrative division or county

"Saratoga"

state

a first-level administrative division

"Massachusetts"

countryRegion

informal subdivision of a country without any political status

"South/Latin America"

country

sovereign nations and their dependent territories, which have a designated ISO-3166 code

"United States of America"

worldRegion

currently only used for appending "West Indies" after the country name, a pattern frequently used in the English-speaking Caribbean

"Jamaica, West Indies"

postCode

postal codes used for mail sorting

"02110"

poBox

post office box: typically found in non-physical (mail-only) addresses

"28"



Evaluate

Click Evaluate from the navigation bar to start an accuracy evaluation. This allows you to:

  • Calculate the accuracy of Match using your gold data.

  • Test different parameter configurations to determine the best settings for your data.

  • Determine the best threshold value for your data.

Evaluate requires the following inputs:

  1. A file containing your gold data, which is a set of annotated name match pairs.

  2. A parameter configuration, which a saved collection of parameter values.

Evaluating name matching accuracy

Matching records which include names is a challenging problem because name spellings can differ in so many ways, including simple misspellings, to nicknames, truncations, variable spaces (Mary Ellen, Maryellen), spelling variations, and names written in different languages. Nicknames have a strong cultural component. The terminology itself is problematic because matching implies two things that are the same or equal, but name matching is more about how similar two entities are. Once you have a measure of similarity, you may need additional rules or human analysis to determine if it is a match. It is important to understand these challenges when evaluating name accuracy.

Evaluation data

Data used to measure accuracy should include a wide variety of phenomena that make name matching challenging, including misspellings, aliases or nicknames, initials, and non-Latin scripts. Applying organizational domain knowledge to curating name data that contains specific phenomena found in your real world cases is an ideal starting point for crafting this data set.

Your data for testing accuracy should contain labeled or annotated data. This is often called gold data, referring to the accuracy of the training set's classification for supervised learning techniques. For name matching, it is a list of name pairs, where each pair is labeled as a match or not a match. You can’t calculate accuracy without labeled data. Since assigning classification labels to data can be subjective, you should use multiple annotators on the same data set, determining positive and negative name matches. Establishing a set of annotation guidelines for scoring a classification is necessary, as it provides consistency when classifying the data.

Once you've collected and annotated your gold data, create an evaluation file of name pairs to be imported and used in evaluation.

Evaluation data file

The evaluation data file is a .csv file of annotated name pairs (gold data). It should include both positive (the names are considered a match) and negative (the names are not considered a match) name pairs. All name pairs must be the same entity type.

Table 35. Evaluate import file

Column name

Description

Required?

Example

Name1

First name in the name comparison

Yes

John R. Smith

Name1_Lang

3 letter ISO 693-3 language code

No

eng

Name2

Second name in the name comparison

Yes

Smith John

Name2_Lang

3 letter ISO 693-3 language code

No

eng

Entity_Type

What type of name is this? Person, Organization, Location, Date, or Address

Yes

PERSON

Match

Does Name1 match Name2?

Yes

Y



The first row of the file is a header row, containing the column names of the fields in the file. Each column is separated by a comma; if a value is not provided, that field is left blank but the column must still be included.

Sample file - PERSON

NAME1,NAME1_LANG,NAME2,NAME2_LANG,ENTITY_TYPE,MATCH
Peter Harding,eng,Pete Harding,eng,PERSON,Y
Peter Harding,eng,Harding Peter,eng,PERSON,Y
Peter Harding,,Pete Michael Harding,eng,PERSON,Y
Peter Harding,eng,P. M. Harding,eng,PERSON,N
Peter Harding,eng,Pat Harding,,PERSON,N
Peter Harding,eng,P. B. Harding,eng,PERSON,N
Peter Harding,eng,Pietro Hardin,eng,PERSON,N

Sample file - ADDRESS

123 Fake Street Springfield MO,,123 Fake St Springfield IL,,ADDRESS,N
820 Forest Road,,820 Forrest Rd,,ADDRESS,Y

To upload an evaluation data file:

  1. Click Evaluate from the navigation bar.

  2. Drag or browse for the desired evaluation data file. When it has finished uploading, it will appear in the file list.

Measuring accuracy

Precision, recall, and F1 score are metrics used to evaluate NLP tools. Accuracy is measured as a combination of the three values.

  • Precision answers the question "of the answers you found, what percentage were correct?" Precision is sensitive to false positives; higher is more precise.

  • Recall answers the question "of all possible correct answers, what percentage did you find?" Recall is sensitive to false negatives; higher is better recall.

  • F1 measure is the harmonic mean of precision and recall. The F1 measure is sensitive to both false positives and false negatives; a higher value means better accuracy. It isn't quite an average of the two scores, as it penalizes the case where the precision or recall scores are far apart. For example, if the system finds 10 answers that are correct (high precision), but misses 1,000 correct answers (low recall), you wouldn't want the F1 measure to be misleadingly high.

Calculating precision, recall, and F1

Let's look at how precision, recall, and F1-score are calculated. By comparing actual match results with gold data, we can calculate:

  • TP: True positives. Number of match pairs that scored above the match threshold, and were labeled as matches in the gold data.

  • FP: False positives. Number of match pairs that scored above the match threshold, but were labeled as non-matches in the gold data.

  • TN: True negatives. Number of match pairs that scored below the match threshold, and were labeled as non-matches in the gold data.

  • FN: False negatives. Number of match pairs that scored below the match threshold, but were labeled as matches in the gold data.

Pairs: Number of rows in the gold data file; the number of name pairs compared = TP + FP

Matches: Number of records that are marked as matches in the gold data file. = TP + FN

Precision is an indication of how many of the matches are correct. If there are 2 correct matches, but 6 were identified as matches, P = .33. The name pair is correctly matched 1/3 of the time. If there are no false positives, the precision is 1.

Equation 1. 
precision=TPTP+FP=TPPairs=Pprecision=\frac{TP}{TP+FP}=\frac{TP}{Pairs}=P


Recall is an indication of how many of the correct matches were found. If 2 pairs are identified as matches, but there are 4 pairs that are actual matches, R = .5. This means that Match Studio found the correct match 1/2 the time. If there are no false negatives, the recall is 1.

Equation 2. 
recall=TP(TP+FN)=TPMatches=Rrecall=\frac{TP}{\left(TP+FN\right)}=\frac{TP}{Matches}=R


F1-score is the harmonic mean of precision and recall

Equation 3. 
F1=2(PR)(P+R)F1=2\cdot\frac{\left(P\cdot R\right)}{\left(P+R\right)}


A major benefit of Match Studio is that you can define a threshold for name matching, optimizing for what is most relevant to your use case. For name matching, consider the case where a query name is expected to match one and only one name in the index. If there are three names returned above the threshold, including the correct match, then one name is a true positive (TP), two names are false positives (FP), and there are no false negatives (FN). If the correct match is not returned above the threshold, the number of false negatives will be one.

In this example, where the correct match is returned along with two other matches:

  • The precision is 1/3, since there is one correct match returned with two other matches.

  • The recall is 1.0, since there are no false negatives.

New evaluation

Creates a new evaluation to run Match against a data set along with a selected configuration, helping you configure Match for your environment.

Before you can run a new evaluation, you must:

  1. Click Evaluate from the navigation bar if you are not already on the Accuracy Evaluations screen.

  2. Select New Evaluation in the Options column for the gold data file you want to use.

  3. Verify that you have the desired evaluation data file selected from the drop-down menu.

  4. Select the desired parameter configuration from the drop-down menu.

  5. Start Evaluation.

View evaluation data

Select View Evaluations to display all evaluations performed for a given evaluation data file. Use the Display settings to view the evaluation results for the best threshold, or for a specific threshold. Once you have specified the threshold, you can download the evaluation results for that threshold. Use Threshold report to visualize how the results change as the threshold increases or decreases.

Threshold report

A key feature of Match is that it returns a normalized score between 0 and 1 indicating how similar two names are. This makes integrating Match into existing workflows powerful. However, it leads to the inevitable question: “What threshold should I use?” Without quantitative error analysis, this question is difficult to answer. Match Studio determines the correct value by processing your evaluation data against various thresholds, calculating precision, recall, and F1 for each data type.

The threshold report graphs the precision, recall, and F1 for each threshold value and each data type. As you can see from the graph, the threshold can be adjusted to favor precision or recall. With a high threshold, false positives will be less likely, leading to a higher precision. With a low threshold, false negatives will be less likely, leading to a higher recall. For applications involving border checking, you may be inclined to favor recall to avoid allowing potential threats into a country. Conversely, for Know Your Customer(KYC) applications, leaning your threshold towards precision may be best to reduce false positives. Your choice of threshold should ultimately be based on an analysis of accuracy as a function of threshold for your particular data, as well as your business requirements.

Select Threshold Report from the Options menu for the specific Parameter Configuration.

eval_options.png

The threshold report page contains a table listing the Precision, Recall, and F1 values for each data type, along with the threshold value, along with a threshold graph. The threshold is the value which produces the best F1 score.

Figure 1. Threshold report graph
Threshold report graph


From this page, you can download the following reports as .csv files:

  • Threshold summary, listing the TP, TN, FP, FN, precision, recall, and F-scores for each threshold and entity type.

  • Detailed match results, listing the Match score for each pair, along with the result (TP, TN, FP, or FN) for each threshold value.

Configure

Note

If you are using Match Studio in locked mode, configurations are read-only, and you will not be able to edit them or create new ones. This comes with performance gains, since Match's dynamic configuration endpoints do not have to be enabled. Overrides and stop words are not editable or viewable in Locked Mode.

Hover over the Configure tab on the navigation bar to see configuration options. The options are:

  • Indices: Add or remove indices, configure indices, or search an index.

  • Parameters: Adjust the parameters that determine the match score between two tokens. Sets of parameter values can be saved as parameter configurations and applied to indices.

  • Overrides: Add or remove token pairs that are considered a match regardless of similarity.

  • Stop Words: Add or remove words that are ignored during string matching so that they do not affect the match score.

Create index

Note

If you are using Match Studio in locked mode, indices are read-only. Some features, such as creating new indices within Match Studio and editing existing indices, are only available in unlocked mode. You can, however, link indices from a connected server.

Before searching with Match Studio, you must create an index by uploading a recordset containing your searchable data.

Match Studio imports structured data. Supported file formats are:

  • csv

  • tsv

  • xml

  • json

For .csv files, the first row must be a header row, containing the names of the fields in the source file. For other file types the key names are the field names. The field names must be unique. For .json and .xml files, multi-value fields are automatically detected. For .csv and .tsv files, multi-valued fields must use a delimiter to separate values (for example, John|Jon|Jonathan).

To create a new index:

  1. Hover over Configure on the navigation bar and select Indices.

    The list of existing indices is displayed.

  2. Click Create an Index.

  3. Follow the instructions on the Import Index wizard.

As part of the process of creating a new index, you must select the delimiter used to separate fields in the data file.

You will also have to map fields to the columns in the recordset. For more information, see Mapping.

Mapping

Mapping is the process of assigning data types to the columns in your dataset. Each column must have a data type assigned to it.

The mapping process is different depending on the data file format.

Mapping for .csv and .tsv

Review the following when mapping fields to your index:

  • Import: Deselect any fields you do not want to import into Match Studio. Fields not imported will not be used for matching, nor will they appear in search results.

  • Data Field: The name of the field as it appears in the index file.

  • Display Name: In this column, enter the name of the field as you want it to appear in Match Studio.

  • Searchable: When this is enabled, the field will be considered when determining a match. When disabled, the field will not be considered when determining a match, but will still appear in search results.

  • Grouped Field: Enable this if the data file contains one of the following group field types:

    • Name split into first name, middle name, and last name in separate columns.

    • Date split into day, month, and year

    • Address split into different fields supported by Match Address matching

    • Multi-value grouped fields

    For more information, see Grouped fields.

  • Data Type: For fields used for match, provide the data type to tell Match Studio how it should match this field. For more information, see Data types.

    • For fields that use the Name - General (Match) data type, you must also specify the values used for the Name Type (Match) field later in the wizard.

  • Multi-Value: Enable this if the data file contains fields with multiple values. Such fields with multiple aliases must separate values with a delimiter. You will be prompted to specify the type of delimiter used later in the wizard.

map_csv.png
Mapping for .json and .xml

Enable a field by selecting its corresponding checkbox on the left side of the screen. Only enabled fields will be imported to Match Studio. Disabled fields will not be used for matching, nor will they appear in search results. For each field, the following options are available:

  • Display Name: In this column, enter the name of the field as you want it to appear in Match Studio.

  • Searchable: When this is enabled, the field will be considered when determining a match. When disabled, the field will not be considered when determining a match, but will still appear in search results.

  • Data Type: For fields used for match, provide the data type to tell Match Studio how it should match this field. For more information, see Data types.

    • For fields that use the Name - General (Match) data type, you must also specify the values used for the Name Type (Match) field on the next step of the wizard.

Grouped fields

Match Studio allows for fields to be grouped together when multiple fields in an index map to a single field for searching. For example, if an index contains separate fields for the first name, middle name, and last name, these fields should be mapped to a single full name field for improved matching. The same is true for an index that separates dates into month, day, and year fields.

When mapping an index that contains group fields, you will have to do a bit of setup to create those groups. After the initial mapping step of the wizard, you will reach the Create Field Groups step.

  1. Enter a name for the grouped field under Combined Field Name. For example, for a group consisting of first, middle, and last name, you might name the grouped field "Full Name".

    group1.png
  2. Select the appropriate field type under Combined Field Type. For example, for a group consisting of first, middle, and last name, you should select Name - Person.

  3. Select the checkbox next to all items in the Items Available box which should be included in the group.

  4. Add.

  5. Create Group.

  6. Map each field included in the group. For example, for a group consisting of first, middle, and last name, you should map them to Name Field 1, Name Field 2, and Name Field 3, respectively.

    Tip

    If the data contains a title field, it should be indexed as 1 with the other fields indexed as 2, 3, and 4.

    group2.png
  7. Save.

  8. Create.

Data types

The following data types are predefined in Match Studio and can be selected in the mapping definition.

Table 36. Data types

Data type

Description

Name - Person (Match)

The name, nickname, or alias of an individual.

Name - Organization (Match)

The name of a corporation, institution, government agency, or other group of people defined by an established organizational structure.

Name - Location (Match)

The name of a geographic location such as a city, state, country, region, mountain, park, lake, or address.

Name - General (Match)

A name that is not specified as a Person, Organization, or Location. Indices that use this data type must also use the Name Type (Match) data type.

Name Type (Match)

The value used to specify whether a Name - General (Match) field is a Person, Organization, or Location entity. Indices that use this data type must also use the Name - General (Match) data type.

Date (Match)

A date contains a year, month, and day. All common delimiters for English dates are supported. Dates can be expressed in various orderings, and months can be written as a numeral, their full English name, or the common three-letter abbreviation.

Address (Match)

A postal address of a location.

KEYWORD

Structured content such as an ID, email address, or zip code.

TEXT

Unstructured full-text content such as a description.

INTEGER

A signed 32-bit integer.

DOUBLE

A double-precision 64-bit IEEE 754 floating point number, restricted to finite values.

FLOAT

A single-precision 32-bit IEEE 754 floating point number, restricted to finite values.

BOOL

Boolean, true or false.

LONG

A signed 64-bit integer.

SHORT

A signed 16-bit integer.



 

Configure index

The Configure Index section allows you to change how your data can be searched and how the results are displayed.

To access the Configure Index section:

  1. Hover over the Configure tab on the navigation bar and select Indices.

  2. Select the Configure (gear) icon next to the desired index.

Data

Note

This feature is not available in locked mode.

The Data tab displays information about the index, such as when it was created and which file was used to create it. This can be useful if you need more information to distinguish similar indices.

Import Data

This function allows you to add new entries to an existing index. The source file for the new data must contain all the imported columns from the original index, with identical data fields. Additional columns will be ignored.

Layout

This tab controls how fields are displayed on the Search page.

  • Data Field: The name of the data field as it appears in the index file.

  • Display Name: How the data field name appears in Match Studio

  • Data Type: The data type of the field.

  • Position: Click and drag the icon in this column to rearrange the order in which the fields are displayed. Fields at the top of this list will appear on the far left of the results, while fields on the bottom will appear on the far right.

Configurations

Use this tab to control the matching behavior of an index, including the weight of each field and the details of how match scores are calculated.

configure_index.png
Matching weights

A data field's weight value represents the magnitude of its impact on the final match score. When determining a match, some fields are more important than others. For example, the person name is likely more important in determining a match between two people than the location name. Adjust the weight slider for each field based on its relative importance.

Weight is distributed equally among all fields by default. If a field is missing from a record, that field is ignored and its weight evenly distributed across other fields.

Parameter configuration

Individual name tokens are scored by a number of algorithms. These algorithms can be optimized by modifying parameters, thus changing the final match score. A parameter configuration contains a set of parameter values for one or more specific language pairs and entity types.

Use the Parameter Configuration dropdown menu to set the default parameter configuration for search and batch search. You can use the default parameter configuration (Match-Studio-<version> Default) or create a new parameter configuration. See New parameter configuration for more information on creating a new parameter configuration.

New parameter configurations can also be selected directly from the search page.

Window size

Each Match Studio query is processed in two passes to provide the best combination of speed and accuracy.

  1. The first pass is designed to quickly generate a set of candidates for the second pass to consider.

  2. The second pass compares every value returned by the first pass against the value in the query and computes a similarity score. Multiple scorers are applied in the second pass, to generate the best possible score.

Window size determines the number of scores moved along to the second pass. Increasing window size improves recall, but results in a slower search.

Window size can also be adjusted directly from the search page.

Match threshold

Once the match score is calculated for all values in the index, those with scores greater than or equal to the match threshold are highlighted in the search results.

Display threshold

Once the match score is calculated for all values in the index, only those with scores greater than or equal to the display threshold are returned in the search results. If you aren't seeing results you expect, try lowering the display threshold value to return more results.

Parameters

Hover over the Configure tab and select Parameters to view the Parameters page.

parameters.png

Individual name tokens are scored by a number of algorithms. These algorithms can be optimized by modifying configuration parameters, thus changing the final match score. Match Studio provides an easy way to compare and test different parameter settings based on your data.

A parameter configuration contains a set of parameter values for one or more specific language pairs and entity types.

The Configure page contains a list of existing parameter configuration files. You can:

  • Export the parameter configuration as a .yaml file. This file can be re-imported into Match Studio or used as a reference for making a configuration in Match. This option is disabled for the default parameter configuration.

    Important

    The exported .yaml file cannot be imported directly into Match.

  • Configure (or edit) an existing parameter configuration. You cannot edit a parameter configuration if an evaluation has already been created using it. This option is disabled for the default parameter configuration.

  • Delete a parameter configuration. This option is disabled for the default configuration.

  • Create a New Configuration.

  • Import Configuration from valid .yaml file.

List of parameters

Parameters are separated into common and advanced parameters. Match contains over 100 parameters, but the common parameters (listed in the table below) are the most likely to have an impact on your data. The advanced parameters, on the other hand, have default values which are already tuned to perform well on most queries and datasets. Users should exercise caution when modifying advanced parameters.

Tip

When adjusting parameters, if you encounter one you are unfamiliar with, hover over the parameter name for a description of the parameter.

Parameter name

Description

Behavior

initialsScore

The score that is assigned to an initial matching a token.

Increasing leads to higher final score when there is a match between an initial and a token.

deletionScore

The score applied to an unmatched token when surrounding tokens are matched.

Increasing leads to higher final score when there is an unmatched token.

reorderPenalty

A penalty applied when two tokens match, but are in different positions in each name.

Increasing leads to lower final score when the tokens aren't in the same position.

conflictScore

The score that is assigned to unmatched conflict tokens.

Increasing leads to higher final score.

boostWeightAtRightEnd

A boost applied to the final score applied to the weights of tokens at the right end of the name.

Increasing leads to a higher score for matches between the surnames in English and other languages where the end of the name is more meaningful. It makes the last token more important.

initialsConflictScore

The score that overrides the usual conflict score when initials conflict.

Increasing leads to higher final score when initials conflict.

initialismScore

The score that gets assigned to an initialism matching a name.

Increasing leads to a higher score for matches between initialisms and names.

stuckInitialScore

The score that gets applied when an initial is "stuck" to the preceding token.

Increasing leads to a higher score when there is a stuck initial.

outOfOrderDeletionScore

The score that gets assigned to an unmatched token which, when removed, leaves the remaining tokens out of order.

Increasing leads to a higher score when there is an out of order deletion.

initialsDeletionPenalty

Adjusts the usual deletion score when initials are deleted by multiplying by this amount.

Increasing leads to a higher score when initials are deleted.

genderConflictPenalty

A penalty applied when the gender of the names doesn't match.

Increasing leads to a lower score when the genders of the names do not match.

crossLanguageGenderConflictPenalty

A penalty applied for cross language name matching. This should usually be lower than the Gender Conflict Penalty because some first names appear as different genders in different languages.

Increasing leads to a lower score when the genders of the names do not match across languages.

boostWeightAtBothEnds

A boost applied when the tokens at the beginning and end match. If set too high, middle names may not be matched correctly or may be ignored.

Increasing leads to a higher score when tokens at the beginning and end match.

adjustOneSidedDeletionScores

Multiplies token scores of deleted tokens by this amount if they only occur in only one of the names (but more than one token remains).

Increasing leads to higher scores for deleted tokens if they only occur in one name and more than one token remains.

reorderCorrection

Adjustment made when the tokens in one of the names have been rotated with respect to the other (e.g. A B C D vs D A B C).

Increasing leads to a higher score when tokens have been rotated.

finalBias

This is used to normalize the scores so that they have roughly the same value from one release to the next, as well as between language pairs.

Increasing leads to a higher final score.

Configuring language pairs and types

When editing a parameter configuration, you can edit parameters for all language pairs and entity types, or for specific language pairs and entity types. Control which pairs and types you are configuring using the tree in the upper-left corner of the View Parameters page.

config_pairs.png

If you want to edit parameters for a specific language pair and entity type that is not included in the tree, you can add it using the Add New Pair section in the lower-left corner of the View Parameters page.

When you edit a parameter in your parameter configuration, that parameter's value will be inherited by any language-pair-specific or type-specific parameters "below" it in the hierarchy. For example, if you change adjustOneSidedDeletionScores to "1" for all language pairs and types, the adjustOneSidedDeletionScores parameter will also be set to "1" for English to English matching, and that parameter will be marked as "Inherited" when viewing it in the context of that language pair. Select Jump To to go where the value is inherited from.

inherited.png

If you then set that parameter to "1.2" for English to English Person matching, the parameter value will not change for any other language pairs or types. The parameter will be marked as "Local" to indicate that the value was specified for this language pair and type, and not inherited from anywhere. Click Reset to reset the parameter to the value it would inherit from "above" in the hierarchy.

local.png

Note

Specific language profiles inherit values from global profiles. If there is a value in a more specific profile, it will not inherit the global value. Lower level (more specific) profiles always take precedence over global values.

Only values set in Match Studio are marked as local. If lower level values are set in the parameters file, those values will take precedence, but won't be marked as local.

New parameter configuration
  1. Select New Configuration.

  2. Select an existing parameter configuration to serve as a starting point for the new configuration.

  3. Name the new parameter configuration.

  4. Select Create.

Import parameter configuration

You can create a new parameter configuration by importing a valid import file. A parameter configuration created in this way will have all parameters defined in the file, including parameters not exposed in the Match Studio UI.

Note

You must have a valid import file. A valid import file is in .yaml format and follows the structure of a Match-ES parameter configuration file. You can create a valid import file by exporting the default configuration and editing it by adding or modifying parameters.

  1. Hover over Configure from the navigation bar.

  2. Select Parameters.

  3. Click Import Configuration.

  4. Name the parameter configuration.

    Note

    If the name of an existing configuration is provided and the configuration is editable, its values will be overwritten by the import. If the name of an existing configuration is provided and the configuration is not editable, the import will fail. Otherwise, a new parameter configuration will be created with values from the imported file.

    A parameter configuration is not editable if an evaluation has already been created using the configuration.

  5. Select OK. The new parameter configuration appears in the parameter configuration list.

Overrides

Note

Overrides and stop words are not editable or viewable in locked mode.

Hover over the Configure tab and select Overrides to view the Overrides page. This page allows you to add or remove token pairs that are considered a match regardless of similarity. You can use this list for things like names with unusual nicknames, such as "Margaret" and "Peggy."

An override pair can be one of four types, each of which has a different behavior:

  • NICKNAME - A shortened, familiar, or informal version of a name, as Pete is to Peter (Default)

  • COGNATE - Names that share the same root, but may not be commonly used as a nickname, such as James and Diego or Pedro and Peter

  • VARIANT - Different spellings of the same name, such as Mohammed, Muhamet, and Mohammad

  • SUPPRESS - Pairs of names that you have decided shouldn't match that otherwise would (think of this as as the opposite of an override)

To add a new override set:

  1. Click New Set.

  2. Select the language pair. This is the language of each token in the pair. A pair can be two tokens in the same language.

  3. Enter the name and language of the first token and click Next.

  4. For each token in the second language in the pair, enter the token name, override type, and select an entity type. When you are finished, click Save.

To find an override set:

  1. Select the language pair for the override.

  2. Search for a token that is part of that set in the Enter Keyword field in the upper right.

To delete an override set:

  1. Find the override set using the procedure above.

  2. Select the vertical ellpisis and select Delete Set.

To add a token to an existing override set:

  1. Find the override set using the procedure above.

  2. Select the vertical ellpisis and select Add Token.

  3. Enter the token name, type, and entity type, then click Save.

To remove a token from an existing override set:

  1. Find the override set using the procedure above.

  2. Select the caret icon to expand the override set.

  3. Select the x icon next to the token you would like to remove from the set, then click Save.

Entity types

Matching works differently for different entity types. It is therefore important to consider which entity type(s) an override applies to. The possible entity types are:

  • Person: The name, nickname, or alias of an individual.

  • Organization: The name of a corporation, institution, government agency, or other group of people defined by an established organizational structure.

  • Location: The name of a geographic location such as a city, state, country, region, mountain, park, lake, or address.

Stop words

Note

Overrides and stop words are not editable or viewable in Locked Mode.

Hover over the Configure tab and select Stop Words to view the Stop Words page. This page allows you to add or remove words that are ignored during string matching so that they do not affect the match score. You can use this list for parts of names or titles that are not significant when determining a match, such as "Mr." or "Prime Minister."

You can create new stop words as prefixes to names. To add a new prefix stop word:

  1. Click New Stop Word.

  2. Enter the language, prefix name, and entity type.

  3. Select Save or Save and Add Another.

To find a stop word:

  1. Select the stop word's language.

  2. Search for the stop word using the Enter Keyword field.

To delete a stop word:

  1. Find the stop word using the procedure above.

  2. Click the x icon in the Delete column for that stop word.

Servers

The server is what connects Match Studio to Match-ES, which powers its matching and indexing capabilities. By default, Match Studio connects to the Match Studio Elasticsearch Server. You can also choose to connect to a different external Elasticsearch or OpenSearch server. This is useful if you want to use a different version of Match-ES, Match-OS, or utilize an existing server you already have. (To use Match Studio with Match-OS, certain permissions must be set up.) Match Studio checks the server connection status every 30 seconds.

Note

If Match Studio is in locked mode, it must be connected to a different external Elasticsearch server.

Hover over the server status icon on the right of the navigation bar and select Configure Servers to access the Configure Servers page, where you can switch between servers and add new servers. You can only be connected to one server at a time.

Additional servers are supported for Match-ES version 8.12.2.

Required Elasticsearch parameters

For servers without authentication, the following parameters must be set in elasticsearch-<version>/config/elasticsearch.yml:

  • network.host: 0.0.0.0

  • http.max_content_length: 400mb

  • xpack.security.enabled: false

For servers with basic authentication, the following parameters must be set in elasticsearch-<version>/config/elasticsearch.yml:

  • network.host: 0.0.0.0

  • http.max_content_length: 400mb

  • xpack.security.enabled: true

For all servers, the following parameters must be set in elasticsearch-<version>/plugins/babel-street-match/bt_root/rlpnc/data/etc/parameter_defs.yaml:

  • enableDynamicConfigurationEndpoints:

    • type: boolean

    • static: true

    • default: true

Set up required OpenSearch permissions

Match Studio requires certain permissions in order to function. Using Match Studio with Match-OS necessitates a bit of extra setup regarding these permissions.

  1. Create a new OpenSearch user with a password of sufficient complexity. For example, to create a user called MatchStudioUser with the password qwerQWER!@#$1234, make the following REST call:

    curl --insecure -u admin:admin --location --request PUT \
      'https://localhost:9400/_plugins/_security/api/internalusers/MatchStudioUser' \
      --header 'Content-Type: application/json' \
      --data-raw '{ "password": "qwerQWER!@#$1234" }'
  2. Create a new role with the required permissions. For example, to create a role called MatchStudioRole, make the following REST call:

    curl --insecure -u admin:admin --location --request PUT \
      'https://localhost:9400/_plugins/_security/api/roles/MatchStudioRole' \
      --header 'Content-Type: application/json' \
      --data '{
        "cluster_permissions": [
          "cluster:monitor/main",
          "cluster:monitor/health",
          "cluster:monitor/state",
          "cluster:monitor/nodes/info",
          "indices:data/read/scroll*"
        ],
        "index_permissions": [{
          "index_patterns": ["*"],
          "allowed_actions": [
              "indices:data/*",
              "indices:monitor/settings/get",
              "indices:monitor/stats",
              "indices:admin/get",
              "indices:admin/create",
              "indices:admin/mapping/put",
              "indices:admin/mappings/get",
              "indices:admin/delete",
              "indices:admin/refresh*"
          ]
        }]
      }'
  3. Create the role mapping using the the user name from step 1 and the role name from step 2. For example, to create a role mapping for user MatchStudioUser and role MatchStudioRole, make the following REST call:

    curl --insecure -u admin:admin --location --request PUT \
      'https://localhost:9400/_plugins/_security/api/rolesmapping/MatchStudioRole' \
      --header 'Content-Type: application/json' \
      --data '{ "users" : [ "MatchStudioUser" ] }'

Add external server

To add a new external server:

  1. Hover over the server status icon on the navigation bar and then select Configure Servers. The Configure Servers page appears.

  2. Click Add Server. The Add Server window appears.

  3. Name the server.

  4. Enter the server address. This can be an IP address or a URL.

  5. Enter the server port. This is usually 9200.

  6. Select the server connection type (HTTP or HTTPS).

  7. Select authentication type (No Authentication, Basic Authentication, or API Key Authentication).

  8. If you select Basic Authentication, enter the user name and password. As an additional layer of security, you can also save your username and password as environmental variables, and enter them here.

  9. Next.

  10. Wait for Match Studio to verify the connection and click Close. (If Match Studio is in Locked Mode, the button will instead read Proceed to Import Indices.)

After you finish adding the server, Match Studio will automatically switch to it, and it will become visible on the Configure Servers page. Any parameter configurations will automatically be imported into Match Studio. If you are adding a server to Match Studio in locked mode, you will next be asked to link the desired indices from the new server.

Connect to a server

If there are external servers added to your instance of Match Studio, you can connect to them at any time by completing the following steps:

  1. Hover over the server status icon on the the navigation bar and select Configure Servers.

  2. Hovering over the vertical ellipsis in the Action column for the server you wish to connect to and select Connect.

Cross-language matches

This table identifies the range of cross-language searching and matching that Match fully supports. If your query is a name in an Arabic document in Arabic script, the query may return one or more names in English documents in Latin script, in addition to names from Arabic documents in Arabic script. If the query is a name in English and Latin script, it may return documents from any of the supported languages and their native scripts.

Video tutorials

  • Installation (Windows)

     
  • Installation (Mac)

     
  • Score matrix

     
  • Perform a seach